[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps by sifakis · Pull Request #2186 · AcademySoftwareFoundation/openvdb

sifakis · 2026-03-27T17:04:49Z

Draft for CI and diff review. WIP.

Add WenoLeafPtrs<BuildT>, resolveWenoLeafPtrs, and computeWenoStencil as static __device__ members of VoxelBlockManager<Log2BlockWidth>. These implement the first phase of a two-function WENO5 stencil gather: resolveWenoLeafPtrs performs exactly 3 probeLeaf calls (one per axis) to resolve neighbor leaf pointers; computeWenoStencil fills a caller-provided array with the 19 global sequential indices using WenoPt<i,j,k>::idx. voxelOffset arithmetic uses octal notation: NanoVDB leaf layout encodes (x,y,z) as x*64+y*8+z, so x/y/z strides are 0100/010/1 in octal. WenoPt<i,j,k>::idx is used throughout to remain independent of any future re-alignment with OpenVDB's NineteenPt (which uses a different convention). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Adds the ex_voxelBlockManager_host_cuda example demonstrating the CPU and CUDA VoxelBlockManager implementations, along with design documentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Remove DecodeInverseMapsCPUPlan.md (implementation complete) and distill its non-obvious design decisions into the knowledge base: - §11: decodeInverseMaps is intentionally single-threaded/stateless; caller distributes blocks; contrast with cooperative GPU version. - §12: mPrefixSum is bypassed for bulk access — recomputing from raw mask words via buildMaskPrefixSums is cheaper than unpacking 9-bit fields; mPrefixSum is still used for the cross-word offset in Step 5. - §13: output fill is range-fill + contiguous copy (not scatter) because shuffleDownMask produces a sorted compacted array; std::fill/copy caveat for alignment when output arrays come from TLS or stack pointers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Implementation complete; design rationale distilled into VBMImplementationKnowledge.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Design reference for the per-block stencil gather kernel: decodes inverse maps into block-local scratch, then resolves neighbor leaf pointers and fills N-point stencil index arrays for all active voxels in the block. WENO5 (N=19, R=3) is the motivating instance; architecture is stencil-agnostic. Covers GPU inner loop, CPU SIMD batch design (SIMDw=16, probeLeaf dedup), unified StencilLeafPtrs template, and reach-R generalization considerations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

- §3: Stencil type as template parameter needs index→offsets direction (for-each-slot gather loop), not the offsets→index direction of WenoPt. Clarify relationship to BaseStencil/WenoStencil: geometry-only descriptor, no accessor coupling. - §4: Kernel lambda signature std::array<ValueType,K> kernel(const ValueType* u); output is homogeneous std::array (not tuple); K=1 degenerates to scalar; SoA output layout results[k][BlockWidth] for SIMD efficiency. - Renumber §4-§8 → §5-§9; update open questions accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…notes Adds a self-contained test (lift_test.cpp) exploring a generic SIMD-lifting abstraction: given a scalar tuple→tuple kernel, liftToSimd<W> produces an SoA-wide version that loops over W lanes and is the auto-vectorization target. The motivating kernel is WENO5 normSqGrad (19-point stencil, matching WenoStencil::normSqGrad from Stencils.h). The six weno5() calls vectorize cleanly; godunovsNormSqrd() blocks vectorization in two distinct ways depending on how it is written: 1. std::max / bool isOutside ternaries → "control flow in loop" 2. float sign + fmaxf (no ternaries) → "no vectype for stmt" due to GCC's inability to see through std::tuple's recursive-inheritance struct layout in GIMPLE alias analysis INVESTIGATION.md documents all experiments, findings, current blockers, and proposed next steps (pointer-cache approach, Clang comparison, etc.). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Evangelos Sifakis <esifakis@gmail.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Introduce nanovdb::util::Simd<T,W> (simd_test/Simd.h) — a minimal header-only SIMD abstraction backed by std::array<T,W> with arithmetic operators, SimdMask, min/max, and where(). Mirrors the C++26 std::simd interface for forward compatibility. Rewrite the WENO5 normSqGrad kernel as a template on T: - T=float : scalar __hostdev__ path for GPU (one thread per voxel) - T=Simd<float,W> : W-wide CPU path (one call per batch) A single templated godunovsNormSqrd + normSqGrad definition serves both execution contexts with no #ifdef, structurally matching Stencils.h. Clang 18 vectorizes the Simd<float,16> instantiation (691 ymm instructions in the hot function, assembly-verified); GCC 13 does not. Update INVESTIGATION.md with the full scoreboard, both approaches, and next steps (GCC intrinsics path, benchmarking, nanovdb/util/ integration). Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Auto-detect <experimental/simd> (Parallelism TS v2) via __has_include and __cpp_lib_experimental_parallel_simd. When available, Simd<T,W> and SimdMask<T,W> become thin wrappers around fixed_size_simd / fixed_size_simd_mask, delegating all arithmetic to the standard type. The TS v2 where(mask, v) is a 2-arg masked-assignment proxy; wrap it into the 3-arg select(mask, a, b) form expected by the kernels. Verified with clang++-18 -std=c++26: both paths produce identical assembly (1275 ymm instructions, PASS on all 16 lanes), confirming Clang optimizes through the wrapper completely. Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Document the std::experimental::simd backend alongside the std::array default, including the TS v2 where() adaptation, the auto-detection mechanism, and the assembly comparison showing byte-for-byte identical output between the two backends under Clang 18. Update the vectorization results table and open questions accordingly. Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

StencilKernel.h — new prototype header: - BaseStencilKernel<T, SIZE>: owns mValues[], mDx2, mInvDx2; no grid coupling - WenoStencilKernel<T>: derives from above, provides normSqGrad() - WENO5<T> and GodunovsNormSqrd<T, MaskT>: free functions mirroring Stencils.h - T=float for GPU scalar path, T=Simd<float,W> for CPU batch path lift_test.cpp — rewritten to use WenoStencilKernel<T> directly: - SIMD and scalar reference paths both instantiate the same class - dx passed to constructor; mValues populated via operator[] Simd.h — refinements: - Simd<T,W> and SimdMask<T,W> in Backend A are now pure type aliases for stdx::fixed_size_simd / fixed_size_simd_mask (no wrapper struct) - element_aligned_tag / element_aligned: portable load/store tag, always present; aliases stdx::element_aligned_tag in Backend A, dummy struct in B - Backend B load constructor and store() accept element_aligned_tag (defaulted) - NANOVDB_NO_STD_SIMD opt-out flag to force Backend B INVESTIGATION.md — updated: - Approach B section updated to reflect class hierarchy instead of free functions - Backend B GCC note: the struct-access failure was specific to Approach A's liftToSimd outer-lane loop; Backend B's fixed-count operator loops do vectorize on GCC when used with the Generic-T class hierarchy - New ymm tables for both backends under GCC (Backend A: 1267 total, Backend B: 619 total); both pass correctness Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…chPtrs) Add full CPU batch neighbor-leaf resolution design to the planning doc: - §6: Replace StencilLeafPtrs struct with layered design — shared 3×3×3 bit encoding for probedMask/ptrs[27], stencil-specific batchPtrs population (batchPtrs[4][SIMDw] for WENO5, batchPtrs[3][3][3][SIMDw] for box stencil), and GPU scalar design note kept separate. - §8d: Update lazy-probe section to reference ptrs[27] and 27-bit probedMask; add batchPtrs population step (Phase 2) after the probeLeaf loop. - §8e: Update computeNeededDirs direction table to use 3×3×3 bit positions (bits 4,10,12,14,16,22 for WENO5 face neighbors). - §8f/§8g: Minor notation updates to match ptrs[27] naming. - §9: Resolve ptrs-layout and nExtraLeaves open questions; add prototype scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Standalone CPU-only executable that verifies the neighbor leaf resolution design from StencilGather.md §8d–§8f: - For each VBM block: calls decodeInverseMaps, recomputes nLeaves from jumpMap, then processes SIMDw=16 batches with the full probedMask / lazy-probeLeaf / batchPtrs[4][SIMDw] pipeline. - Does not call computeStencil. Instead verifies batchPtrs against a direct probeLeaf reference for all 18 non-center WENO5 stencil offsets that cross leaf boundaries. - Passes at 0.1, 0.25, 0.5, 0.9 occupancy (2.3M–2.9M lane checks each). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Refactor computeNeededDirs to accept a pre-expanded Simd<uint32_t,SIMDw> vector, moving sentinel/masking responsibility to the single gather site where leafMask is known: - kSentinelExpanded = expandVoxelOffset(292) = 0x41044104 (constexpr) - Caller broadcasts sentinel to all lanes, overwrites leafMask lanes with real expandVoxelOffset() values before calling computeNeededDirs - computeNeededDirs is now a pure add+reduce with no masking or cross-check Carry trick (§8e): expandVoxelOffset packs lz/lx/ly into 6 guarded 3-bit groups; a single vpaddd ymm × 2 + vpor + vpand + shuffle-tree detects all six WENO5 directions simultaneously. kExpandCarryK = 0x514530C3. AVX2 codegen confirmed via objdump: - computeNeededDirs: vpbroadcastd + 2×vpaddd ymm + vpor/vpand ymm + vextracti128/vpsrldq shuffle-tree, no branches or calls in hot path - activeMask/leafMask in runPrototype: vpcmpeqd ymm × 4 + vmovmskps ymm × 2 - Sentinel broadcast: 0x41044104 literal → vpbroadcastd → 2×vmovdqa ymm Always-on scalar cross-check at every computeNeededDirs call site. verifyComputeNeededDirsSentinel() tests both the sentinel carry property and the straddle-lane non-pollution scenario before runPrototype(). StencilGather.md §8e and §8f updated to match new API and codegen notes. Phase 1 prototype marked complete in §9. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Documents the BatchAccessor — the SIMD-batch analog of ValueAccessor — developed from the ex_stencil_gather_cpu Phase 1 prototype discussion. Core concept: instead of caching the path to one leaf, cache the full 3×3×3 neighborhood of 27 leaf pointers around the current center leaf, serving SIMDw voxels per call. Key design elements documented: Eviction policy: fires on none_of(leafMask) only — straddle lanes do not evict. leafMask is the "partial-hit" signal with no scalar-accessor analog. Prefetch coverage argument: - WENO5 (R=3): 6 extremal taps (±R,0,0),(0,±R,0),(0,0,±R) are necessary and sufficient — equivalent to the computeNeededDirs carry trick - Box stencil (R=1): 8 corner taps (±1,±1,±1) collectively cover all 26 non-center directions for any voxel position in the batch Three-tier API: - prefetch<di,dj,dk>(vo, leafMask, treeAcc) - cachedGetValue<di,dj,dk>(vo, leafMask) — no treeAcc, cache assumed warm - getValue<di,dj,dk>(vo, leafMask, treeAcc) — lazy combined (vanilla style) Template <di,dj,dk> rationale vs runtime Coord: compile-time direction bit, dead-axis elimination, VDB convention alignment; runtime Coord overload provided for generic stencil adapters. AVX2 profile: offset arithmetic (vpaddd ymm), lane split (vpcmpgtd ymm), gather from ≤2 leaf pointers (vgatherdps×2 + vpblendvb) — both scalar bottlenecks from Phase 1 prototype are eliminated. StencilGather.md: add cross-reference to BatchAccessor.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Introduces BatchAccessor<BuildT,ValueT,VoxelOffsetT,LeafIDT,PredicateT>, the SIMD analog of NanoVDB's ValueAccessor. Instead of caching the path to one leaf, it caches the 27-entry 3×3×3 neighbor pointer table around the current center leaf and serves a SIMD batch of LaneWidth voxels per call. Key design decisions -------------------- - Eager center: constructor and advance() populate mLeafNeighbors[13] directly (O(1), no probeLeaf), so cachedGetValue<0,0,0> is valid immediately and the probe loop never needs a center special-case. - SWAR neededMask: prefetch<di,dj,dk> expands the 9-bit voxel offsets into a 15-bit packed form (lz@[0:2], lx@[6:8], ly@[12:14]) using SIMD bitwise ops, then adds a compile-time packed stencil offset and checks carry bits for crossing detection. One vpaddw YMM instruction covers all 16 lanes; clang folds packed_d into the blend constants at compile time, reducing the expand+blend+add to 5 SIMD instructions. - Heterogeneous where: Simd.h gains where<T,U,W>(SimdMask<U,W>, ...) so a PredicateT=SimdMask<float,W> can gate a VoxelOffsetT=Simd<uint16_t,W> blend without explicit casting. Array backend uses a trivial bool loop; stdx backend converts via a bool[] round-trip. - Correctness verified in-process: stencil_gather_cpu.cpp integrates BatchAccessor as an alternate execution path and cross-checks all 18 WENO5 tap directions against direct tree references (12.3M lane checks). Simd.h additions (array backend) --------------------------------- - SimdMask<T,W>: converting constructor from SimdMask<U,W> - Simd<T,W>: operator|, &, ^, <<(Simd), >>(Simd) - where<T,U,W>: heterogeneous mask overload (both backends) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Simd.h: - Add scalar_traits<T> / scalar_traits_t<T>: extracts element type from plain scalars (identity) and Simd<T,W> (T); used by BatchAccessor static_asserts and as the shift-count type for the new uniform-shift ops. - Add Simd::operator<<(T) / operator>>(T): uniform scalar shift (all lanes by the same immediate). Maps to vpsllw imm8 / vpsrlw imm8 on x86 — distinguished from the existing per-lane Simd<<Simd overload which would require the nonexistent vpsllvw instruction for 16-bit lanes. BatchAccessor::prefetch SWAR section: - Replace hard-coded uint16_t/uint32_t casts with VoxelOffsetScalarT (= scalar_traits_t<VoxelOffsetT>) so the code is correct for any unsigned 16+-bit instantiation, not just Simd<uint16_t,W>. - Add class-level static_asserts (unsigned + sizeof >= 2) with explanatory messages referencing the SWAR carry-detection contract. - Remove static_assert(LaneWidth >= 16): was a performance aspiration, not a correctness requirement; SWAR works for any LaneWidth >= 1. - Use 'auto' for expanded / packed_lc / packed_sum (type already expressed by the initializer); keep explicit uint32_t for hor_or/hor_and/s where the width is a deliberate semantic choice. - Replace kMask15 hex literal 0x71C7u with 0b111'000'111'000'111u (binary makes the three 3-bit mask fields and three 3-bit gaps visually explicit). - Use vo << VoxelOffsetScalarT(9) (uniform shift) instead of vo << VoxelOffsetT(VoxelOffsetScalarT(9)) (broadcast-then-per-lane). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

danrbailey · 2026-04-16T05:28:55Z

I know that this is marked as not for review yet, but I just wanted to comment and say that specifically there is one aspect of this PR which is something I've wanted in OpenVDB for a long time. I brought this up in our TSC meetings again recently. A lot of the time when we use the ValueAccessor, it is not true random access, but stencil access patterns and the fact that the ValueAccessor is only able to store one leaf node means you end up thrashing the cache in the ValueAccessor when accessing different neighboring leaf nodes along leaf boundaries.

I have done a little experimentation in this area and the lazy leaf neighbor pattern that you use here is what I was envisioning, though I was contemplating whether a templated stencil argument (ie 6-neighbor vs 27-neighbor) may provide improved performance depending on the use case. I also hadn't settled on a name - NeighborAccessor, StencilAccessor, etc.

I would be very interested in comparing performance of a similar structure for OpenVDB that doesn't include the SIMD logic to compare with performance of the ValueAccessor (that also does not use SIMD). If it outperforms the ValueAccessor, I think we should look to switch some of our neighbor use cases across.

The other big issue with the ValueAccessor IMO is the tight coupling with the Grid and Tree (getValueAndCache() etc). This hurts impacts compile times and a new acceleration structure that extends the ValueAccessor in a new direction might be a good opportunity to revisit this and the attach/release accessor mechanism. I know NanoVDB took a slightly different approach on these trade offs.

(@kmuseth - a good topic for a future TSC meeting I think)

…lingZeros - Add 2-argument where(mask, target) = value proxy (stdx and array backends): stdx-style masked assignment; encourages GCC to emit vpblendvb - Add util::reduce(v, op) to both backends: tree-reduces to a scalar with std::bit_or<>{} / std::bit_and<>{} etc.; replaces scalar horizontal loop - Add scalar reduce(T, BinaryOp) identity overload for W=1 path - Add util::countTrailingZeros(uint32_t) to nanovdb/util/Util.h: __hostdev__, CUDA/HIP/__builtin_ctz/MSVC/De Bruijn dispatch; removes ad-hoc ctz from Simd.h - BatchAccessor: use 2-arg where for packed_lc blend, util::reduce for hor_or/and, util::countTrailingZeros in toProbe loop, hoist root ref out of loop body Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Replace the per-lane scalar loop in cachedGetValue with a fully SIMD gather chain that populates offsets, prefixSums, and maskWords without any scalar iteration over lanes. Pipeline (all in Simd<T,W>): packed_sum (uint16_t) → ×1129 → >>10 → &31 : d_vec (0..26), stays uint16_t — bits [10:14] of product lie below bit 16 so modular uint16_t multiply is exact → gather(mNeighborLeafIDs) : leaf_id_vec (uint32_t) → ×kStride : raw_idx (int32_t, null lanes → 0) → gather(offset_base) : mOffset per lane → gather(prefix_base) : mPrefixSum packed, then shift-extract field w → gather(mask_word_base + w) : valueMask().words()[w] per lane Switch mLeafNeighbors[27] (const LeafT*) to mNeighborLeafIDs[27] (uint32_t) with kNullLeafID = ~uint32_t(0) sentinel, enabling the flat-base SIMD gather pattern. prefetch and advance updated accordingly. Add simd_cast<DstT>(Simd<SrcT,W>) to Simd.h for widening (uint16_t → int32_t, uint16_t → uint64_t, uint32_t → int32_t) used in gather index construction. Debug cross-check (#ifndef NDEBUG) validates all three vectors against the scalar reference path; 12M+ lane checks pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

§8a was stale ("prefetch uses a scalar loop") — prefetch has no per-lane loop. Update with accurate description of its two-phase structure: - SWAR expansion + vpblendvb + vpaddw: fully in YMM across all LaneWidth lanes (vpsllw, vpor, vpand, vpblendvb, vpaddw) - Horizontal reduction: unavoidable vextracti128 + vpand/vpor tree to produce scalar hor_and / hor_or for the per-axis crossing decision Include the actual Release assembly (ex_stencil_gather_cpu, -O3 -mavx2) confirming the YMM path survived the mLeafNeighbors → mNeighborLeafIDs encoding change. §8c: update nullptr sentinel language to reflect kNullLeafID / valid_u32 mask in the SIMD gather chain. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

- Promote SWAR encoding literals to class-scope uint16_t constants: kSwarXZMask (0x1C07), kSwarYMask (0x00E0), kSwarSentinel (4|4<<5|4<<10). Shared between prefetch and cachedGetValue; implicit conversion to VoxelOffsetT at Simd construction time. - Add direction-extraction local constants in cachedGetValue: kSwarCarryMask (0x6318), kDirMul (1129), kDirMask (31). - Rename w_vec -> wordIndex for clarity. - Move U64Traits alias inside #ifndef NDEBUG where it is only used. - Replace non-ASCII characters (em dashes, arrows, element-of, comparison operators) with ASCII equivalents throughout. - Add explicit parentheses around the d_u16 shift-then-mask chain to make operator precedence unambiguous. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…liases Simd.h: - Add scalar simd_cast<DstT>(SrcT) overload (degrades to static_cast). - Add scalar 2-arg where(bool, T&) masked-assignment proxy matching the SIMD WhereExpression form. - Add scalar gather(const T*, int32_t) and gather_if(T&, bool, ...) to complete the scalar overload set alongside the existing where(bool,T,T). BatchAccessor.h: - Remove LeafIDT template parameter (was reserved/unused; now derived). - Add private class-scope LeafIDVecT and LeafDataVecT using conditional_t: plain uint32_t/uint64_t when LaneWidth==1, Simd<T,W> otherwise. This upholds the convention that scalar instantiations use underlying types directly rather than Simd<T,1> wrappers. - Replace local U32T/U64T aliases in cachedGetValue with class-scope names. stencil_gather_cpu.cpp: - Drop the now-removed LeafIDT argument from the BAccT instantiation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

cachedGetValue is now fully vectorised end-to-end for ValueOnIndex grids. The scalar leaf->getValue(offset) loop is removed; result is filled via a 2-arg where(isActive, result) = ... directly on the output argument so that leafMask-clear and inactive-voxel lanes are never touched. Key design decisions recorded in BatchAccessor.md §8e: - tapLeafOffset_i64 widened to int64_t before *= kStride to avoid uint32_t overflow (kNullLeafID = 0xFFFFFFFF causes wild gather indices in uint32_t). simd_cast_if<int64_t>(dst, valid_u32, src) writes 0 for invalid lanes, keeping gather indices non-negative for vpgatherqq (signed int64_t). - gather_if gains a MaskElemT template parameter to support heterogeneous masks: valid_u32 (SimdMask<uint32_t,W>) applied to uint64_t data fields. - Activity check: ValueOnIndex::getValue returns 0 for inactive voxels. Detected as isActive = (maskWords & (1<<dest_yz)) != 0. Null-leaf lanes have maskWords=0 and are therefore implicitly handled by the same check. - popcount uses a SWAR shift-and-add tree (popcount64 in Simd.h) rather than __builtin_popcountll: AVX2 has no 64-bit lane-wise popcount (VPOPCNTQ is AVX-512DQ only), and the scalar popcnt instruction is not vectorisable. - mOffsetBase, mPrefixBase, mMaskWordBase promoted to class-level const pointers, computed once in the constructor, shared across all 18 cachedGetValue instantiations in a WENO5 stencil gather. Verified: debug build (-O0) + release build (-O3), 12 321 275 lane checks across all 18 WENO5 non-center taps, zero mismatches. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Add §8f with assembly comparison between GCC 13 and Clang 18 (-O3 -DNDEBUG -march=native, cachedGetValue<1,0,0>, W=16): - GCC does not inline any Simd.h helper (gather_if, simd_cast, where, popcount); emits 14 out-of-line calls and 13 vzeroupper transitions. gather_if body uses scalar vmovq/vpinsrq/vinserti128 — no hardware gathers. - Clang inlines everything except popcount; emits 2 vpgatherdd + 12 vpgatherqq hardware gather instructions and only 2 vzeroupper transitions. popcount body (88 ymm instrs, pure SWAR vpsrlq/vpand/vpaddq) is also fully vectorized but remains out-of-line. - 43 vpinsrb in Clang output are mask-widening cost for heterogeneous gather_if (SimdMask<uint32_t,16> → 4x SimdMask<uint64_t,4>). Action item added to §10: [[gnu::always_inline]] on Simd.h helpers would eliminate the GCC regression and fold popcount inline in both compilers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…adeoffs Expand §8f with analysis of three popcount strategies for the 16-lane uint64_t case: - SWAR shift-and-add tree (current): 88 instructions, all AVX2-friendly but port-intensive; stays in vector registers throughout. - Scalar popcnt with extract/reassemble: ~56 instructions; popcnt is pipelined (1/cycle throughput on port 1, 3-cycle latency) so 16 independent lanes retire in ~16 cycles, but the ymm↔GPR domain crossing adds ~2 cycles bypass latency per extraction and port 1 serialises all 16 popcnts. - vpshufb nibble-LUT + vpsadbw (recommended): ~40 instructions, no domain crossing, uses ports 0/5 and 5 (orthogonal to SWAR ports), standard compiler-generated AVX2 popcount pattern. Added as action item in §10. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Adds StencilAccessor<BuildT,W,StencilT> — a compile-time-parameterized SIMD wrapper around BatchAccessor that owns the straddling loop, hull-prefetch sequencing, and per-tap cachedGetValue blending for one VBM block. Output is mIndices[SIZE] of Simd<uint64_t,W>, one vector per stencil tap. Includes Weno5Stencil (18 taps, 6-tap hull) and the findIndex constexpr fold for compile-time getValue<DI,DJ,DK>() inverse-map lookup. Also: - BatchAccessor.h: add centerLeafID() getter used by StencilAccessor - BatchAccessor.md: expand §8f assembly matrix (compiler × backend × ISA) - stencil_gather_cpu.cpp: wire StencilAccessor into runPrototype with verifyStencilAccessor correctness checks; add rdtsc-based runPerf - Simd.h: remove = {} default argument from element_aligned_tag overloads - CLAUDE.md: add project build/architecture reference for Claude Code Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…util/ All three headers are now shared dependencies across multiple examples and are no longer specific to a single example directory. Consolidate into nanovdb/nanovdb/util/ to match the existing NanoVDB utility layout. - simd_test/Simd.h → nanovdb/util/Simd.h - ex_voxelBlockManager_host_cuda/BatchAccessor.h → nanovdb/util/BatchAccessor.h - ex_voxelBlockManager_host_cuda/StencilAccessor.h → nanovdb/util/StencilAccessor.h Update all #include paths in BatchAccessor.h, StencilAccessor.h, stencil_gather_cpu.cpp, and simd_test scratchpad files. Also add HaloStencilAccessor.md: design document for the dense halo buffer approach to CPU WENO5 stencil extraction (successor to StencilAccessor). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…sor.md to nanovdb/util/ Keep design docs co-located with their headers now that Simd.h, BatchAccessor.h, and StencilAccessor.h live in nanovdb/nanovdb/util/. VBM-specific docs (StencilGather.md, VBMImplementationKnowledge.md, VoxelBlockManagerContext.md) remain in ex_voxelBlockManager_host_cuda/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

INVESTIGATION.md, StencilKernel.h, and lift_test.cpp were investigation artifacts from the superseded liftToSimd / generic-T approach that preceded BatchAccessor and Simd.h. Relevant findings are preserved in nanovdb/util/BatchAccessor.md. Untracked scratch files (assembly outputs, binaries, codegen experiments) deleted alongside. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

The inline prefix-scan / SWAR-popcount investigation block (lines 374-438) was an algorithm-exploration artifact that reimplemented decodeInverseMaps internals rather than testing the public API. Removes the block along with the now-unused popcount32 helper and the immintrin.h include. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

- New LegacyStencilAccessor.h: scalar ReadAccessor-based reference, templated on the same StencilT as StencilAccessor. Serves as a correctness oracle and performance baseline (mirrors the accessor path-eviction behaviour of OpenVDB's math/Stencils.h). - ex_stencil_gather_cpu: rewrite runPerf to multi-thread both paths via util::forEach, with an XOR-checksum anti-DCE, nanovdb::util::Timer wall-clock timing, and a decodeInverseMaps-only baseline pass. The decode step is ~3 ms of the ~127 ms end-to-end time (~2%), so the SIMD-vs-scalar comparison is dominated by the stencil-gather itself. - examples/CMakeLists.txt: bump ex_stencil_gather_cpu flags from -mavx2 to -march=native. - BatchAccessor.md §8h: full end-to-end codegen investigation on i9-285K Arrow Lake, 32 threads, 16.7 M active voxels. Default GCC 13 -O3 outlines 14 Simd.h helpers per cachedGetValue and outlines the 18 per-tap calls from calcTaps, producing ~323 calls and ~282 vzeroupper per 16-voxel batch — StencilAccessor runs at 7.5 ns/vox, losing to scalar LegacyStencilAccessor at 5.4 ns/vox. [[gnu::flatten]] on StencilAccessor::moveTo collapses the entire call tree into a single 77 KB inlined body: GCC drops to 3.7 ns/vox (2x), beating Clang's 4.3 ns/vox. Flattening only BatchAccessor::{prefetch,cachedGetValue} is insufficient (4.9 ns/vox). W=8 cuts YMM spills by 86% but regresses GCC end-to-end because per-batch framing overhead amortises over fewer lanes. Attributes not applied in the shipped headers — opt-in for StencilAccessor consumers that need peak GCC performance. - StencilAccessor.md §8.1: short GCC codegen note pointing to §8h. Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…af / tree-walk) Decomposes LegacyStencilAccessor's 5.4 ns/voxel end-to-end cost into three components by running two additional timed passes before the full Legacy pass: - framing-only: full Legacy loop structure with no accessor call, using the center Coord components as the anti-DCE sum seed. Measures the decodeInverseMaps + Coord-compute + 18-iteration inner loop + store. - center-hit x 18: calls mAcc.getValue() 18 times per voxel on distinct coords that all lie within the center voxel's leaf (an 8x2x1 slab parameterised by k in [0..17]). Every call is a guaranteed leaf-cache hit, so there is no tree walk; the distinct coords prevent CSE. Measures framing + accessor cache-check + leaf-local lookup. - full: existing LegacyStencilAccessor path, for reference. Subtracting gives: cache-check + leaf-local lookup = center-hit - framing tree-walk cost (amortised over misses) = full - center-hit Measured breakdown (i9-285K Arrow Lake, 32 threads, 16.7 M active voxels): framing : ~0.25 ns/vox (5%) cache+leaf : ~0.94 ns/vox (17%, 0.05 ns/tap — fully load-port pipelined) tree walk : ~4.24 ns/vox (78%, serialised adjacent-leaf re-descents) Finding: tree walks on leaf-cache miss dominate LegacyStencilAccessor's cost despite affecting only ~25% of taps. Each miss is ~4.5 cycles (single lower-node re-descent), but they chain serially — unlike the leaf-local lookups which the OoO engine fully pipelines through the three load ports. This is the cost that a 27-leaf neighbour cache (BatchAccessor-style) eliminates. Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Replaces BatchAccessor's fully-SIMD 18-step gather chain (§8e) with a hybrid design that keeps the naturally-SIMD work vectorised and hands the per-tap leaf lookup to a scalar loop. Motivated by the end-to-end codegen analysis in BatchAccessor.md §8f / §8h: - Without [[gnu::flatten]], GCC outlined the 14 Simd.h helpers inside cachedGetValue and the 18 per-tap calls from calcTaps, paying ~323 function calls and ~282 vzeroupper per 16-voxel batch and losing to scalar LegacyStencilAccessor (7.5 vs 5.4 ns/voxel). - With flatten, GCC ran at 3.7 ns/voxel but as a single 77 KB monolithic body per StencilAccessor instantiation — strong dependency on a compiler-specific attribute for competitive perf. Hybrid split (BatchAccessor.md §8i): Stays SIMD (native __m256i uint16 throughout — no _Fixed<W> aggregate ABI, no gathers, no heterogeneous where-blends): - prefetch<di,dj,dk>() — unchanged. - Setup in cachedGetValue: SWAR expansion, packed_sum, base-32 direction extraction (d_u16), packed-layout -> 9-bit local offset extraction (localOffset_u16). Goes scalar (via two util::store harvests and a bitmask leafMask): - Per-lane pointer chase into mNeighborLeafIDs + mFirstLeaf. - One leaf.getValue(offset) call per active lane — the LeafNode handles valueMask / mPrefixSum / popcount internally. Public API changes: BatchAccessor::cachedGetValue signature: - before: void cachedGetValue(ValueT& result, VoxelOffsetT vo, PredicateT mask) - after: void cachedGetValue(ScalarValueT (&dst)[LaneWidth], VoxelOffsetT vo, PredicateT mask) The C-array output allows the scalar tail to write lane results with a single mov, eliminating the 18× WhereExpression::operator= outlined body. StencilAccessor: - Storage: Simd<uint64_t,W> mIndices[SIZE] -> uint64_t mIndices[SIZE][W], now public. Layout is part of the ABI. - moveTo() returns void (active-lane info is leafIndex[i] != UnusedLeafIndex, already available to the caller — no duplicate mask). - Removed getValue<DI,DJ,DK>() and operator[]; added a static constexpr tapIndex<DI,DJ,DK>() for reorder-safe compile-time named-tap access. - Public API now contains zero Simd<>/SimdMask<> types. Callers SIMD-load tap rows from mIndices[k] with whatever backend they choose, or iterate scalarly — we don't impose a choice. Simd.h additions: - util::store(v, p) — uniform store shim dispatching to stdx::copy_to on the stdx backend and to Simd::store on the array backend. BatchAccessor.h additions: - mFirstLeaf member (cached getFirstLeaf() base) for the scalar tail leaf lookup. End-to-end perf (32 M ambient / 50% / 32 threads, i9-285K Arrow Lake): Variant GCC ns/vox Clang ns/vox Old SIMD path, no flatten 7.5 4.3 Old SIMD path, +flatten on moveTo 3.7 4.3 Hybrid (this commit), no flatten 5.1 4.9 Hybrid +flatten on moveTo 4.8 4.8 LegacyStencilAccessor 5.5 6.7 Without flatten, the hybrid is 31% faster than the old SIMD path on GCC and within 0.2 ns/voxel on Clang, eliminating the 3x compiler spread that §8f / §8h documented and making both compilers beat the scalar LegacyStencilAccessor oracle. The cost on GCC (1.4 ns/voxel vs SIMD+flatten 3.7) is recoverable by re-applying flatten at the caller site — but the shipped code no longer requires it for acceptable performance. Correctness: 12M lane-checks across all 18 WENO5 taps pass against the LegacyStencilAccessor oracle; XOR checksum matches across the full 16.7 M active voxel workload. Docs: - BatchAccessor.md §8e marked "historical"; §8i added with the hybrid design, perf matrix, and cleanup notes (gather/simd_cast/popcount helpers in Simd.h are no longer exercised and can be removed in a follow-up). - StencilAccessor.md §8.1 rewritten to describe the Simd-free public API and the new caller pattern. Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…lue: decomposition infra Adds benchmarking infrastructure used to decompose the stencil-gather pipeline into its real cost components via perf PMU counters. None of this ships in the default code path; it lives alongside the hybrid design to support follow-up perf analysis and to preserve decomposition variants that might be reused later. BatchAccessor: - Add mFirstLeaf cached base pointer (grid.tree().getFirstLeaf()) so the scalar tail in cachedGetValue can do leaf lookups without going back through mGrid.tree() each time. - Add cachedGetValueInLeaf<di,dj,dk>: benchmarking-only variant of cachedGetValue that wraps taps to the center leaf via a mod-8 SWAR mask (kSwarFieldMask = 0x1CE7). Used to isolate the hybrid's single-leaf floor cost with no cross-leaf variety. Requires (di,dj,dk) in [0,7] per axis. StencilAccessor: - Add moveToInLeaf: counterpart to moveTo that calls cachedGetValueInLeaf instead of cachedGetValue. prefetchHull is skipped (center leaf is always cached after construction/advance). Benchmarking only — results have no geometric meaning. ex_stencil_gather_cpu: - Add CLI flags: --pass=<name> run one pass in isolation. Supported names: framing, decode, center-hit, stencil, degenerate, inleaf, legacy, legacy-branchless, all (default), verify. Needed for clean perf-stat attribution — the default harness runs every variant back-to-back and perf cannot restrict counters to a subrange. --threads=<n> gate TBB parallelism via tbb::global_control. Single-threaded runs give cleaner perf attribution; taskset alone does not limit TBB's worker-thread count. - Add DegenerateStencil: 18 identical (0,0,0) taps. Under CSE the per-tap work collapses to 1 evaluation per lane — isolates the hybrid's pipeline framing cost when the compiler can factor out all tap work. - Add InLeafStencil: 18 distinct compile-time taps in [0,6] per axis. Combined with moveToInLeaf, exercises the full hybrid pipeline with 18 distinct per-tap work items that all hit the center leaf — isolates the "same-leaf no-CSE" floor. - Add four new timed passes (with anti-DCE checksums): degenerateUs DegAccT .moveTo() inLeafUs InLeafAccT.moveToInLeaf() framingUs Legacy loop with no accessor call centerHitUs Legacy loop with 18 distinct same-leaf coords (CSE-defeated) legacyBranchlessUs Legacy loop with leaf.getValue() replaced by the unconditional formula mOffset + prefix + popcount(maskWord & mask-1) — drops the valueMask.isOn branch, giving a 3x speedup. - decodeUs baseline already existed; kept. - Each pass is guarded by wantPass(name) so --pass=<name> runs exactly one variant in isolation. - Output extended: per-pass ns/voxel lines + a Legacy cost decomposition block (framing / cache+leaf / tree walk). Architectural summary: the decomposition revealed that the true CPU bottleneck is unpredictable branch mispredicts on the valueMask.isOn() check inside LeafNode<ValueOnIndex>::getValue(offset), not L1 pressure or tree-walk latency. Detailed perf matrix, revised attribution, and follow-up implications are documented in BatchAccessor.md §8j and StencilAccessor.md §8.2 (separate commit). Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…d attribution Captures the findings of the perf-counter investigation made possible by the decomposition infrastructure in the prior commit. Preserves prior claims as history (with targeted "Revision note" annotations pointing to the superseding section), adds the new authoritative measurements and decomposition, and updates the current-API sections to match the shipped hybrid design. BatchAccessor.md: - §8g, §8h, §8i: prepend "Revision note" blocks pointing to §8j. The original text is preserved intact — only a short header note flags which claims were subsequently refined. Affected claims: * §8g cycle budget model: gather chain assumed to dominate; actually pipelined away by OoO, real dominant cost is branch mispredicts. * §8h "multi-leaf L1 pressure accounts for cross-leaf overhead": refuted, L1 miss rate is flat across all variants. * §8i "27-leaf cache eliminates tree walks" magnitude overstated; measured savings are ~0.3 ns/voxel, not ~3-4 ns/voxel. - §8j (new, ~230 lines): full perf-counter chronicle. Nine subsections: 8j.1 Motivation 8j.2 Methodology (--pass/--threads flags, PMU setup, P-core pinning) 8j.3 Measurement matrix for 6 variants (IPC, branch-miss rate, L1 miss rate, ns/voxel on a single P-core single-threaded) 8j.4 Identifying valueMask.isOn(offset) as the real source 8j.5 Branchless experiment (legacy-branchless pass) — 3x speedup on Legacy from 5.6 to 2.0 ns/voxel at 32 threads 8j.6 Revised 5.4 ns/voxel attribution: ~3.6 isOn mispredicts, ~0.75 leaf-local work, ~0.3 tree-walk diff, ~0 L1 pressure 8j.7 NANOVDB_USE_INTRINSICS is a no-op on GCC 13 at -O3 -march=native (SWAR pattern-matched to popcnt) — enable for portability 8j.8 Architectural implications (branchless LeafNode::getValue proposal, revised halo value proposition, hybrid rationale) 8j.9 Historical correction log — single-table index of revised claims with their source sections - §10 Status: add the hybrid refactor and §8j findings to Completed; add "branchless LeafNode<ValueOnIndex>::getValue" as the top Remaining item — the single biggest cheap CPU-side speedup available. StencilAccessor.md: - §8.2 (new): summary of the PMU finding pointing to BatchAccessor.md §8j. Contains the headline measurement matrix, branchless-experiment delta, revised 5.4 ns/voxel attribution, and consequences for design decisions. - §9: rewrite to document the current API (tapIndex<DI,DJ,DK>() + public mIndices[SIZE][W]). Prior getValue<>()/operator[] API was removed in the hybrid refactor; that evolution is noted in-section. - §10 Caller-side usage pattern: rewrite to match current API — moveTo returns void, active-lane check via leafIndex[i] != UnusedLeafIndex, three direct mIndices access idioms (scalar iteration, SIMD row load, compile-time named-tap access). - §11 Ownership summary: mIndices[SIZE] -> [SIZE][W]; tap-fold row updated to reflect direct writes (no where-blend). - §12 Design decisions 1 and 3: preserve original rationale, append "Evolution" / "Revised" notes explaining the shift to void moveTo return (decision 1) and removal of operator[] (decision 3) in favour of public mIndices + tapIndex<>(). Decisions 2, 4-6 unchanged. Net: +550 lines of documentation capturing the investigation as a chronicle, with every revised claim traceable to both the superseded prior section and the measurement that refined it. Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…ly ReadAccessor Adds a branchless sibling to LeafData<ValueOnIndex>::getValue, and switches the CPU stencil-gather test scaffolding to use a leaf-only ReadAccessor variant. Motivated by the PMU-counter investigation in BatchAccessor.md §8j: the single data-dependent branch inside LeafData::getValue -- if (!(word & bit)) return 0; -- is the dominant cost for hot stencil loops over OnIndex trees, driving branch-mispredict rates to 8-10% on random-access workloads (IPC ~2) and still to ~1.7% on spatially-coherent narrow-band workloads (IPC ~4.2 vs a potential ~5.5). NanoVDB.h: LeafData<ValueOnIndex>::getValueBranchless(uint32_t i) Same semantics as getValue -- returns 0 for inactive voxels, otherwise mOffset + prefix9(n) + popcount(w & (bit - 1u)) -- but replaces the early-return guard with an arithmetic mask gate: uint64_t mask = (w & bit) ? ~0ull : 0ull; return mask & sum; The ternary over two constant arms compiles to test + cmov on x86, not a conditional jump. Verified on GCC 13 at -O3 -march=native. Semantics are bit-for-bit identical to getValue (checksums match across both synthetic random-50% and real narrow-band workloads including OFF voxels). Scoped to LeafData<ValueOnIndex> (not LeafNode) as an opt-in expert path for neighbourhood-aware hot loops (BatchAccessor-class cachers, HaloStencilAccessor-style precomputation). The generic LeafNode and ReadAccessor APIs are unchanged. LegacyStencilAccessor: use ReadAccessor<BuildT, 0, -1, -1> For GetValue workloads the upper/lower cache slots of DefaultReadAccessor are never consulted (OpT::LEVEL == 0 falls straight to mRoot on a leaf cache miss; NanoVDB.h:5387) and are only passively written during the walk. Switching to a 1-level leaf-only cache removes that passive bookkeeping. Measured effect: Legacy path sees no change (its pipeline is already stalled on branch mispredicts, extra stores overlap for free). The branchless path consistently improves by 4-9% -- when IPC is near peak, every retired instruction counts. ex_stencil_gather_cpu updates - legacy-branchless pass now calls leaf->data()->getValueBranchless instead of the hand-inlined formula it used during investigation. The checksum now matches LegacyStencilAccessor exactly (the hand- inlined version wrote non-zero garbage for OFF voxels); the new method is semantically a drop-in replacement. - center-hit and legacy-branchless passes both use the leaf-only ReadAccessor construction. Measured speedup (32 M ambient / 50% random occupancy / i9-285K, 24 cores wall clock): Legacy: 94.5 ms Legacy branchless: 34.3 ms (2.8x) Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…input Counterpart to ex_stencil_gather_cpu, which uses a procedurally generated random-occupancy domain (a pathological worst case for branch prediction inside LeafData::getValue). This example loads an openvdb level-set FloatGrid from disk, converts it to a NanoVDB ValueOnIndex topology grid plus a separately-allocated float sidecar, and runs the same perf- decomposition battery on a workload with realistic spatial coherence. Purpose: quantify how much of the isOn-branch-mispredict cost (see BatchAccessor.md §8j) survives on real narrow-band traversals where consecutive tap queries mostly land on active voxels. Findings to be documented in the follow-on commit. Pipeline: openvdb::io::File(path) -- disk load -> openvdb::FloatGrid -- first FloatGrid, or --grid=<name> -> nanovdb::tools::CreateNanoGrid<openvdb::FloatGrid> -- builder .getHandle<ValueOnIndex, HostBuffer>(channels=0) -- topology only .copyValues<ValueOnIndex>(sidecar.data()) -- float sidecar -> validateSidecarOrdering() -- 1000-voxel sanity check -> runPrototype + runPerf (identical to ex_stencil_gather_cpu) The separated-sidecar path avoids the double-conversion alternative that would be needed to opt out of the grid's blind-data. Sidecar ordering is verified at startup by comparing floatGrid.getValue(ijk) against sidecar[indexGrid.tree().getValue(ijk)] on 1000 random active voxels (opt-out via --skip-validation). The sidecar is plumbed through but not yet consumed by any stencil path -- it exists for future "fetch values via the sidecar" work. CLI: narrowband_stencil_cpu <path.vdb> [--grid=<name>] [--pass=<name>] // same set as // ex_stencil_gather_cpu [--threads=<n>] [--skip-validation] Both the legacy-branchless pass (calls LeafData::getValueBranchless, introduced in the prior commit) and the center-hit / legacy-branchless accessor construction (leaf-only ReadAccessor<BuildT, 0, -1, -1>) match the updated ex_stencil_gather_cpu conventions. Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…nd findings, leaf-only accessor Follow-on chronicle for the investigation in §8j. Three things happened since: (1) the hand-inlined branchless variant was promoted into NanoVDB.h proper as LeafData<ValueOnIndex>::getValueBranchless; (2) a real narrow-band benchmark (ex_narrowband_stencil_cpu, taperLER.vdb) was added to validate the finding on a workload with realistic spatial coherence; (3) we noticed the scaffolding was using a 3-level ReadAccessor where only the leaf-level cache can contribute and switched to ReadAccessor<BuildT, 0, -1, -1>. BatchAccessor.md §8k (new, ~170 lines) — the new findings: - §8k.1 The new API: getValueBranchless code, semantics (identical to getValue; checksum matches byte-for-byte), scope decision (LeafData only, not LeafNode/ReadAccessor), note that the shipped version includes the OFF-returns-0 mask gate that the hand-inlined benchmarking variant omitted. - §8k.2 The new ex_narrowband_stencil_cpu benchmark — .vdb loading, topology-only IndexGrid + separately-allocated float sidecar path, startup sidecar validation. - §8k.3 Measurement matrix (single-P-core, PMU counters): Legacy vs getValueBranchless on synthetic vs narrow-band. Refines §8j: narrow-band is NOT pathological for branch prediction (1.74% miss rate, IPC 4.22); random-access is (8.07% miss rate, IPC 1.96). getValueBranchless recovers near-peak IPC on both. - §8k.4 Accessor cache-level finding: 3-level ReadAccessor's upper and lower cache slots are never consulted for GetValue. Switching to a 1-level leaf-only accessor gives 4-9% additional speedup on branchless paths (no measurable effect on Legacy, which is backend-bound on mispredicts). Scope caveat: benchmark-only; the library default stays at 3-level for mixed workloads. - §8k.5 Updated end-to-end headline numbers: 24-core Arrow Lake, narrow-band 85 → 60 ms (1.4×), synthetic 95 → 34 ms (2.8×). - §8k.6 What the §10 Remaining list should now show as complete, and future follow-ons (ProbeValue variant that reuses getValueBranchless, steering-team proposal). BatchAccessor.md §10 "Completed" updated with three new bullets: - getValueBranchless shipped (§8k) - ex_narrowband_stencil_cpu shipped (§8k.2) - Leaf-only accessor switch in benchmark scaffolding (§8k.4) Removed the "branchless LeafNode::getValue" item from "Remaining" (done, at LeafData level per scope decision). StencilAccessor.md §8.2 updated: the "cheap architectural win" bullet now points at the shipped getValueBranchless rather than the "would be" drafting. The HaloStencilAccessor bullet reframed: its remaining advantage is "zero per-tap work at query time" now that the mispredict storm is already addressable by getValueBranchless. Cross-references §8j and §8k. Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Fold the branchless reformulation (previously shipped as a sibling getValueBranchless in 8a24ddf) into getValue as its default body, and gate the pre-2026 branchy form behind NANOVDB_USE_BRANCHY_GETVALUE. The branchless form is strictly faster or within ~0.1 ns/vox on every workload measured, and the OFF path is preserved bit-for-bit via the ternary-constant mask-AND -- checksums match. Update the two CPU stencil-gather examples to call the single getValue entry point, and refresh BatchAccessor.md §8k + StencilAccessor.md §8.2 to describe the toggle-based final API and the API evolution history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…aths The 8-pass decomposition battery (decode, stencil, degenerate, inleaf, framing, center-hit, legacy, legacy-branchless) was load-bearing while attributing Legacy's cost between tree-walks, leaf-local work, and isOn mispredicts (BatchAccessor.md §8g-§8k). That attribution is now settled and getValue is branchless by default, so the diagnostic variants no longer earn their keep. Retain four passes in both ex_stencil_gather_cpu and ex_narrowband_stencil_cpu: - decode : decodeInverseMaps cost baseline - framing : loop + coord compute + anti-DCE, no accessor call - stencil : hybrid StencilAccessor (SIMD cache + scalar getValue tail) - legacy : LegacyStencilAccessor (per-tap probeLeaf + getValue) Drop: DegenerateStencil / InLeafStencil types and their accessor aliases, the degenerate/inleaf/center-hit/legacy-branchless blocks, and the "Legacy cost decomposition" printout that consumed center-hit. Correctness unaffected (StencilAccessor vs LegacyStencilAccessor cross-validation still runs and passes); checksums match between the two shipped paths on both benchmarks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Added a tap-outer, voxel-inner variant of the Legacy path (same leaf-only ReadAccessor, same probeLeaf + getValue mechanics, just the nested loops swapped) as a new `legacy-transposed` benchmark pass in both examples. Checksums match the voxel-outer `legacy` pass byte- for-byte on both synthetic and narrow-band workloads. During the experiment we hit a GCC inlining pitfall: a runtime-args inner lambda `[&](int di, int dj, int dk)` invoked 18 times via parameter-pack fold did *not* get inlined (lambda body contains a 128-iteration loop; GCC's inline budget × 18 is exhausted). Result: 18 explicit call instructions to a 542-byte processTap function with 6-register prologue/epilogue per call, plus tap offsets becoming runtime register arguments (one spilled to stack) — accounting for ~13 % of the observed slowdown vs. Legacy. Fix is a templated lambda `[&]<int DI, int DJ, int DK>() [[gnu::always_inline]]` dispatched via `.template operator()<...>()`. Standalone processTap symbol vanishes; transposed body grows 4.4 → 9.8 KB, matching Legacy's 10.5 KB. Measured at ~32M active voxels on i9-285K (24 threads): - Narrowband taperLER: Legacy 2.2 ns/vox vs Transposed 2.1 ns/vox (marginal, within the ~10 % noise floor) - Synthetic 64M/50%: Legacy 2.4 ns/vox vs Transposed 2.8 ns/vox (+19 %, outside noise) Implementation verdict: LegacyStencilAccessor's voxel-outer moveTo stays the default. Tap-outer has no consistent perf advantage, and voxel-outer wins on cleanliness (self-contained accessor, no scratch arrays, no compiler-inlining fragility, 1:1 mapping to the stencil- operator mental model). `legacy-transposed` kept as a benchmark pass for future reference. BatchAccessor.md §8l captures the experiment, the inlining-pitfall lesson, the measurement matrix, and the implementation-quality rationale behind keeping voxel-outer as the production default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…WENO pipeline target Our Weno5Stencil in StencilAccessor.h was missing the center tap <0,0,0>, defining only the 18 off-center ±1,±2,±3 per-axis taps. The canonical WenoStencil in nanovdb/math/Stencils.h has 19 taps with WenoPt<0,0,0>::idx = 0, followed by x-axis (1..6), y-axis (7..12), z-axis (13..18). Fixed our ordering to match exactly, so any caller using the canonical WenoPt index convention gets the same slot layout as our Weno5Stencil::Taps. The Hull unchanged: the center never crosses a leaf, so it contributes nothing to the face-neighbour prefetch set (§4b monotonicity argument). Code impact is data-driven (SIZE = tuple_size_v<Taps>): the StencilAccessor/LegacyStencilAccessor mIndices storage grows from [18][W] to [19][W], the parameter-pack fold in fillTaps now expands to 19 calls, and the tap-outer `legacy-transposed` pass iterates 19 inner loops. No hard-coded 18s in the benchmarks. Cross-path checksums still match byte-for-byte (hybrid vs legacy vs transposed), now with the center voxel's index included: - synthetic 64M/50%: 0x0000000001e1b860 (was 0x0000000000b2ce4fc) - narrowband taperLER: 0x000000001a38de2b (was 0x000000001f8ad71a) Rough 19/18 ratio shows up in the timings: narrowband narrowed ~2-5% across all three implementations (within the earlier noise floor); synthetic transposed ticked up more (~13%) but stays within the already-observed voxel-outer-vs-tap-outer spread. --- Also documented the end-to-end CPU WENO5 target pipeline in BatchAccessor.md §11. This captures the three-phase structure the experimentation in §5-§8 has been building toward: (1) Decode inverse maps (shipped) (2) Per-batch sidecar value assembly — the next phase: float mValues[Ntaps][W] via sidecar lookup, per W=4/8 batch (per-batch scope to keep mValues L1-resident, ~9.5 KB block-wide would be too big) (3) Full SIMD WENO over Simd<float, W> Documented the sign-extrapolation convention for out-of-band taps (magnitude = |sidecar[0]|, sign inherited from the next-inner tap on the same axis), its loop-order implications (tap-outer / axis-major / ascending-|Δ|), and two open questions (distance-1 sign source, cascade transitivity). Flagged that the §8l voxel-outer-vs-tap-outer measurements do not settle the Phase-2 loop order question: they were taken at BlockWidth=128 inner-loop size over uint64 indices; the real Phase-2 inner loop will be W=4 or W=8 over floats. Output-layout-match and sign-extrap dependency both push toward tap-outer at that scale, but a rerun at the real batch width is required before locking it in. Plus small 18→19 comment sweeps in the live source; historical measurement narratives in §8c-§8k left alone (they describe what was measured at 18 taps at the time and don't update automatically). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Restructured the `legacy`, `framing`, and `legacy-transposed` passes in both examples so that after `decodeInverseMaps` the work runs as an outer `for batchStart in 0..BlockWidth step SIMDw` loop with an inner lane loop over SIMDw=16. The StencilAccessor pass was already batched (reference structure); this aligns every other pass with it. Notes: - LegacyStencilAccessor / ReadAccessor instances stay outside the batch loop (one per TBB task) — same amortisation as before. - `legacy-transposed`: `centers[]` and `s[]` scratch arrays shrunk from BlockWidth=128 to SIMDw=16; the templated `processTap` lambda's inner loop also runs over SIMDw lanes. This is Stage 1 of the Phase-2 pipeline preparation (BatchAccessor.md §11): Phase 2 will slot a sidecar-aware value assembly + SIMD WENO kernel into the same batch loop, so every path needs that shape in place first. Perf impact (24-thread, i9-285K): - StencilAccessor: unchanged (already batched). - Legacy: ~5 % slower on narrowband, noise-level on synthetic. - Legacy-transposed: ~6-7 % slower both workloads — the per-batch `processTap` lambda fires 8x instead of once per block. Checksums match byte-for-byte across all three passes on both workloads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Stage 2 of the Phase-2 pipeline preparation (BatchAccessor.md §11). Three new benchmark passes that assemble per-batch value + activity matrices from the sidecar, plus a stand-in token op for anti-DCE: float values [SIZE][SIMDw] -- sidecar[idx] (idx==0 -> background) bool isActive[SIZE][SIMDw] -- (idx != 0) Token op: per active voxel, sum values[k][i] over taps with isActive[k][i]==true, write to a second sidecar at the voxel's VBM-sequential index (firstOffset + bID*BlockWidth + lane), which equals the center voxel's ValueOnIndex by VBM construction. Supporting changes: - convertToIndexGridWithSidecar: set sidecar[0] = floatGrid.background() after copyValues, so sidecar[idx] is unconditionally valid. - runPerf: takes `const std::vector<float>& sidecar`; allocates a local outputSidecar (same shape as input) for the new passes. Pass variants: - sidecar-legacy: LegacyStencilAccessor (scalar moveTo per voxel) - sidecar-stencil: StencilAccessor (hybrid SIMD+scalar moveTo per batch, reads mIndices[k][i] directly) - sidecar-transposed: tap-outer ReadAccessor probeLeaf + getValue All three produce identical checksums (cross-validation): - checksum 0xcfbff7c8 on taperLER.vdb End-to-end timings (24 threads, i9-285K): sidecar-legacy: 130 ms (4.1 ns/voxel) sidecar-stencil: 97 ms (3.0 ns/voxel) -- best sidecar-transposed: 126 ms (4.0 ns/voxel) StencilAccessor's SIMD moveTo amortises both the direction-decode and the per-lane scalar work; its tap-outer contiguous mIndices[k][i] rows also let the sidecar gather and value-array fill vectorise cleanly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…polation New header nanovdb/nanovdb/util/WenoStencil.h defines WenoStencil<W>, a 19-tap WENO5 stencil value container parameterised on SIMD lane width. Same storage declaration holds for both scalar (W=1, GPU) and SIMD (W>1, CPU batched) instantiations via std::conditional_t: W == 1 : ValueT = float PredT = bool W > 1 : ValueT = float[W] PredT = bool[W] Storage is plain C arrays (mValues[SIZE], mIsActive[SIZE]) so callers fill lane-by-lane with the same syntax in both cases; an internal addr() helper bridges the W=1 scalar-reference / W>1 array-decay asymmetry for SIMD load/store inside extrapolate(). extrapolate(absBackground) repairs out-of-band lanes in-place: for each tap k with mIsActive[k][i] == false, writes copysign(absBackground, mValues[innerTap][i]) via a hardcoded ascending-|Δ| cascade (18 pairs): |Δ|=1 taps <-- center (0,0,0) |Δ|=2 taps <-- |Δ|=1 on same axis |Δ|=3 taps <-- |Δ|=2 on same axis Cascade order guarantees the inner tap is already resolved when the outer tap is processed, so sign propagates through |Δ|=1 -> |Δ|=2 -> |Δ|=3 without special casing. Implementation uses only Simd.h primitives (Simd<float,W>, SimdMask, where, operator>, unary minus, store) which collapse cleanly to scalar code under W=1 fixed_size<1>. No Simd.h changes needed. Also adds a new benchmark pass `sidecar-stencil-extrap` in ex_narrowband_stencil_cpu: same fill as sidecar-stencil, then calls stencil.extrapolate(|sidecar[0]|), token sum over ALL 19 taps (previously sum was gated by isActive). Results on taperLER.vdb (24 threads, i9-285K): Sidecar (stencil) : 97.3 ms (3.1 ns/voxel) checksum=0xcfbff7c8 Sidecar (stencil+extrap): 101.8 ms (3.2 ns/voxel) checksum=0x371273d0 Extrap cost: +4.5 ms / 31.8M voxels = ~0.14 ns/voxel (~50 cycles for 18 SIMD blend pairs). Checksum for the extrap pass stable run-to-run; differs from sidecar-stencil as expected since the sum now includes extrapolated values instead of excluding out-of-band lanes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

New WenoStencil.md captures the Stage-3 design rationale for nanovdb/nanovdb/util/WenoStencil.h: - §2 single-source scalar/SIMD via std::conditional_t storage (ValueT = float vs float[W], PredT = bool vs bool[W]) - §3 the addr() bridge for W=1 scalar-ref vs W>1 array-decay at SIMD load/store sites; rejected alternative (Simd.h reference overloads) and why - §4 extrapolation semantics (OOB problem, ascending-|Δ| cascade, hardcoded kPairs table for Weno5) - §5 extrapolate() implementation walkthrough + per-pair cost model - §6 API usage (fill / extrapolate / named-tap access) - §7 future work: reconstruct() method, Weno5Stencil policy consolidation, generic StencilPolicy parameterisation - §8 relation to BatchAccessor.md / StencilAccessor.md / HaloStencilAccessor.md BatchAccessor.md §11.6 (Implementation status) added to the "Target pipeline" section summarising Stage 1 (batch-restructure), Stage 2 (sidecar variants), and Stage 3 (WenoStencil + extrapolate) with commit SHAs, perf tables, and a trimmed "Remaining" list pointing at the Phase-3 WENO5 arithmetic kernel as the next step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ffers Refactor WenoStencil<W> to hold first-class Simd values directly, rather than raw float[W]/bool[W] C arrays with an addr() bridge at every SIMD load/store site. The class becomes pure-compute: no fill-side storage, no addr() ceremony, no element_aligned tags inside the kernel bodies. The caller owns whatever scalar-scatter target shape is natural for its source (typically a pair of stack-local alignas(64) C arrays at CPU W>1; no intermediate buffer at all on CUDA W=1). FloatV values [SIZE]; // Simd<float, W> — first-class storage MaskV isActive[SIZE]; // SimdMask<float, W> float mDx2, mInvDx2; // scalar grid constants, broadcast on use At W=1 the Simd types collapse to plain float/bool under the array backend — the CUDA per-thread code path reads as pure scalar arithmetic, which was the primary motivation for the refactor: scalar is the production workhorse on GPU, not a degenerate convenience of the CPU SIMD design. The earlier Approach-1 design (std::conditional_t<W==1, float, float[W]> + addr() helper) forced W=1 to load `Simd<float, 1>` via address-of- scalar, which propagated SIMD-ish ceremony into every __hostdev__ method. Approach 2 sidesteps this: at W=1 there is no Simd wrapper at the storage level at all. Also adds `normSqGrad(iso = 0.f)`, the fifth-order upwind Godunov norm-square-gradient method. Line-for-line transliteration of the scalar ground-truth `WenoStencil<GridT>::normSqGrad(isoValue)` in `nanovdb/math/Stencils.h`, adapted to generic-T: - Six axial WENO5 reconstructions via free-function `nanovdb::detail::WENO5<T>` — mirrors `nanovdb::math::WENO5<T>` identically, wrapping integer literals in RealT() for Simd broadcast. - `nanovdb::detail::GodunovsNormSqrd<T, MaskT>` computes both outside/inside branches unconditionally and blends via util::where; replaces the scalar ground-truth's runtime if/else so the SIMD path has no control-flow divergence across lanes. - Final scalar mInvDx2 broadcast to FloatV at the single combinator multiplication (free on x86; identity at W=1). sidecar-stencil-extrap pass in ex_narrowband_stencil_cpu updated to the new API: - Caller allocates stack-local raw_values[SIZE][SIMDw] / raw_active[SIZE][SIMDw] inside the TBB task body. - Scalar scatter fill populates the raw buffers from the sidecar via stencilAcc.mIndices[k][i]. - SIMD load-per-tap moves the data into stencil.values[k] / stencil.isActive[k]. - stencil.extrapolate(absBackground) operates in-place on the Simd values. - The token sum over 19 taps now runs as a pure Simd accumulate, then stores to a stack-local sum_lanes[SIMDw] for the per-voxel scalar write to the output sidecar (gated by leafIndex). Checksum matches byte-for-byte (0x371273d0 on taperLER.vdb), timing within noise of the pre-refactor baseline (~82 ms / 2.6 ns/voxel end-to-end on 24 threads). WenoStencil.md rewritten: §2 now describes Simd-typed storage and the caller-owned fill-buffer convention (the addr() bridge section is removed); §4 documents normSqGrad including the generic-T WENO5 / GodunovsNormSqrd helpers and their relationship to the scalar ground-truth; §5.1/§5.2 show the two usage patterns (CPU SIMD with raw buffers + explicit load; CUDA scalar with direct per-tap assignment); §6 diagrams the ownership boundaries; §7.1 flags that normSqGrad perf hasn't been measured yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…ne pass Adds the full Phase-2+3 WENO5 pipeline as a new benchmark pass: decode → sidecar fill → load into WenoStencil<W> → extrapolate → normSqGrad → store |∇φ|² to output sidecar. The pass replaces the debug "sum over 19 taps" intermediate of sidecar-stencil-extrap with the real Phase-3 arithmetic — what goes into the output channel is now the actual norm-square-of-gradient, not an anti-DCE token. Existing sidecar-stencil-extrap kept as a perf-decomposition waypoint that isolates the cost of load+extrapolate alone. Grid voxel size comes from grid->voxelSize()[0] (isotropic assumption for narrow-band level sets); iso = 0 for zero-crossing semantics. Measurements on taperLER.vdb (31.8M active voxels, 24 threads, i9-285K): Sidecar (stencil) : 83.6 ms (2.6 ns/voxel) fill + debug sum Sidecar (stencil+extrap) : 78.5 ms (2.5 ns/voxel) + extrapolate Sidecar (+normSqGrad) : 97.3 ms (3.1 ns/voxel) + full WENO5 + Godunov Phase-3 arithmetic cost isolated: 18.8 ms / 31.8M vox = 0.59 ns/voxel end-to-end on 24 threads. Per-core per-voxel: ~14 ns, roughly matching the expected ~100-FMA WENO5+Godunov budget at AVX2 throughput. Full Phase-2+3 end-to-end over decode: 90.7 ms = 2.85 ns/voxel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

… arithmetic Assembly analysis of the sidecar-stencil-normsqgrad pass revealed that GCC was outlining tiny one-instruction wrappers despite their `inline` declaration. The cost model flagged them "too heavy" because the calling convention passes 256-bit YMM arguments by value, regardless of how trivial the body actually is. Result: - util::min / util::max / util::where on Simd<float,16> emitted as weak symbols with 6-register prologue/epilogue + vzeroupper per call. GodunovsNormSqrd made 20 such calls per voxel batch. - detail::WENO5 and detail::GodunovsNormSqrd also emitted as weak out-of-line, called 6+1 times from WenoStencil<16>::normSqGrad. - WenoStencil<16>::normSqGrad and ::extrapolate also out-of-line, called from the pass lambda body. Each outlining layer forced a vzeroupper (AVX→SSE state transition required by the x86 ABI when the callee is not marked VZEROUPPER-aware). Per 16-voxel batch the normSqGrad path incurred ~26 vzeroupper + 8 function calls; over 2M batches × 24 threads this dominated what should have been a tight inlined SIMD block. Fix: [[gnu::always_inline]] on six sites. Simd.h — the stdx-backend wrappers at namespace nanovdb::util: - min<T,W>(Simd, Simd) - max<T,W>(Simd, Simd) - where<T,W>(SimdMask, Simd, Simd) (homogeneous) - where<T,U,W>(SimdMask<U>, Simd<T>, Simd<T>) (heterogeneous) WenoStencil.h: - detail::WENO5<T,RealT> - detail::GodunovsNormSqrd<T,MaskT> - WenoStencil<W>::extrapolate - WenoStencil<W>::normSqGrad With the attribute applied, all four weak symbols (WENO5, Godunov, extrapolate, normSqGrad) disappear from the object file. The pass AcademySoftwareFoundation#8 (sidecar-stencil-normsqgrad) body grows from 1693 B to 9269 B as the full Phase-2+3 pipeline collapses into one inlined block. Hot-path instruction mix (stdx backend, -O3 -march=native, AVX2+FMA3, Arrow Lake): 518 packed SIMD arithmetic ops: 218 vmulps + 128 vaddps + 72 vsubps + 48 vdivps + 24 vmaxps 160 FMAs: 71 vfmadd231ps + 36 vfmadd132ps + 29 vfnmadd132ps + 24 vfnmadd231ps 28 vbroadcastss (scalar constants to YMM) 1160 / 2044 instructions use ymm (57 %) 3 calls remaining: decodeInverseMaps, StencilAccessor::moveTo, __stack_chk_fail. 3 vzeroupper total (down from ~26 scattered across the outlined call tree). Measured on taperLER.vdb (31.8 M active voxels, 24 threads, i9-285K): Before: Sidecar (+normSqGrad) = 97.3 ms = 3.1 ns/voxel After: Sidecar (+normSqGrad) = 82.4 ms = 2.6 ns/voxel Delta: −15 % end-to-end. Isolating the Phase-3 arithmetic alone (normSqGrad pass minus sidecar-stencil-extrap pass, which stops at the extrap step): Before: 18.8 ms / 31.8 M vox = 0.59 ns/voxel After: 2.7 ms / 31.8 M vox = 0.09 ns/voxel Delta: −86 %, i.e. 7× faster on the WENO5 + Godunov arithmetic. Checksum changed 0x4371e374 → 0x438f725f. Expected: with the helpers inlined at a single source location, GCC fuses more mul+add pairs into FMAs, giving slightly different FP-rounding behaviour. Both outputs are equally correct; the difference is non-deterministic FP-arithmetic artifacts of instruction scheduling, not a semantic change. No regressions elsewhere: every other pass (StencilAccessor, Legacy, sidecar-{legacy,stencil,transposed}, legacy-transposed) stayed within ~2 % of its previous timing on this run. Background: this is the analogue for Phase-3 of the fix documented in BatchAccessor.md §8h for the StencilAccessor::moveTo path, where [[gnu::flatten]] on the moveTo() call site served the same purpose (force-inlining Simd.h helpers that GCC outlines despite `inline`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…lidation New standalone example that exercises the complete CPU WENO5 pipeline on a narrow-band level set loaded from a .vdb file and validates the result against a scalar reference: reference : per-voxel scalar nanovdb::math::WenoStencil<NanoGrid<float>> normSqGrad, driven by a TBB-parallel leaf walk. Tile values at internal nodes and in-leaf inactive values are preserved verbatim during the openvdb::FloatGrid -> NanoGrid<float> conversion, so the stencil's ReadAccessor fetches correctly- signed values for taps outside the narrow band (the same convention OpenVDB's narrow-band level-set builder sets up, which matches the semantics of our explicit extrapolate()). fast : VBM decode -> LegacyStencilAccessor scalar gather -> per-tap SIMD load into WenoStencil<SIMDw> -> extrapolate() -> normSqGrad() -> Simd->scalar bridge -> per-voxel scalar write to outputFast. No hybrid SIMD StencilAccessor; voxel-outer LegacyStencilAccessor chosen for code clarity. Cross-path indexing: both passes write to a sidecar-shaped output buffer keyed by ValueOnIndex slot (slot 0 = background, slots 1..N = active voxels in NanoVDB ordering). The index grid's accessor resolves `ijk -> slot` in the reference pass; VBM's firstOffset + bID*BlockWidth + lane gives the same slot in the fast pass, by VBM construction. After both passes, prints a log-decade histogram of |outputRef[i] - outputFast[i]| across all active voxels, plus max/mean deltas. On taperLER.vdb (31.8M active voxels, 24 threads, i9-285K): reference (scalar): 120.4 ms (3.80 ns/voxel) fast (VBM+SIMD) : 91.3 ms (2.87 ns/voxel) speedup: 1.3x |Delta| histogram: [0, 1e-10) : 23,431,571 (73.75%) bit-exact match [1e-10, 1e-9 ) : 2 ( 0.00%) [1e-9, 1e-8 ) : 2 ( 0.00%) [1e-8, 1e-7 ) : 1,122,483 ( 3.53%) single-ULP-class noise [1e-7, 1e-6 ) : 7,218,133 (22.72%) FMA-fusion-class noise [1e-6, inf ) : 0 ( 0.00%) max |Delta| = 9.5e-7 (at slot 12,390,855: ref=2.28233, fast=2.28234) mean |Delta| = 4.1e-8 No voxels disagree above 1e-6. Our explicit extrapolate() plus SIMD normSqGrad reproduces the scalar OpenVDB/NanoVDB tile-extrapolation ground truth to within FP-rounding tolerance — the sign-cascade rule (kPairs table in WenoStencil.h) is consistent with OpenVDB's topological tile-value convention on typical narrow-band SDFs. End-to-end 1.3x speedup reflects that both paths pay a roughly equal sidecar-gather cost; the 7x Phase-3-arithmetic speedup documented in BatchAccessor.md §11.6 shows up as only a fraction of total time because gather dominates (60% of fast-path wall clock) and is still scalar-scatter-shaped in both paths. A SIMD-gather front-end (the hybrid StencilAccessor path) would widen this, at the cost of the library complexity the clean demonstrator deliberately avoids. Library dependencies exercised by the fast path: - LegacyStencilAccessor.h (per-voxel scalar gather) - WenoStencil.h (Simd compute container) - Simd.h ([[gnu::always_inline]] helpers) - branchless LeafData::getValue (default body of getValue) - Weno5Stencil::Taps (tap tuple; currently in StencilAccessor.h) Not exercised — candidates for demotion or tighter doc-scoping: - StencilAccessor.h + BatchAccessor.h (hybrid SIMD gather path) - HaloStencilAccessor.md (speculative halo gather) CMakeLists.txt hook mirrors ex_narrowband_stencil_cpu: OPENVDB-gated, -march=native -fopenmp-simd, no CUDA dependency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

.#weno_nanovdb_cpu.cpp is an emacs per-user lock artifact that got swept up by `git add` in the previous commit. It's a dangling symlink to esifakis@host:pid, not a meaningful repo artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

sifakis changed the title ~~NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps~~ [WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps Mar 30, 2026

sifakis force-pushed the vbm-cpu-port branch from 249e657 to 294f604 Compare April 2, 2026 18:44

sifakis and others added 2 commits April 2, 2026 13:46

sifakis force-pushed the vbm-cpu-port branch from 294f604 to c9906f4 Compare April 2, 2026 18:46

sifakis and others added 15 commits April 2, 2026 13:50

ex_voxelBlockManager_host_cuda: remove DecodeInverseMapsCPUPlan.md

9a31aed

Implementation complete; design rationale distilled into VBMImplementationKnowledge.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

sifakis and others added 9 commits April 16, 2026 10:33

sifakis and others added 25 commits April 18, 2026 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186

[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186
sifakis wants to merge 51 commits intoAcademySoftwareFoundation:masterfrom
sifakis:vbm-cpu-port

sifakis commented Mar 27, 2026

Uh oh!

danrbailey commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sifakis commented Mar 27, 2026

Uh oh!

danrbailey commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants